Mining Word Senses from Text for Corpus-Based Lexicography

نویسندگان

  • Canasai Kruengkrai
  • Thatsanee Charoenporn
  • Virach Sornlertlamvanich
  • Hitoshi Isahara
چکیده

This paper discusses the problem of automated lexicography. In the corpus-based approach, a lexicographer has to manually group contexts of a target word into clusters in order to identify word senses. When a large number of the contexts is given, this process becomes a tedious and time-consuming task. To overcome this problem, we propose an efficient technique based on unsupervised clustering. We present the spherical Gaussian EM algorithm that can be enhanced by combining a robust initialization method based on Principal Component Analysis. The resulting clusters can provide a structure for analyzing the underlying senses of the target word found in a text corpus. Experimental results on two different data sets of polysemous words indicate that our proposed algorithm is a promising technique for corpus-based lexicography.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Co-Occurrrence Patterns among Collocations: A Tool for Corpus-Based Lexical Knowledge Acquisition

One of the main problems for applied natural language processing is gaps in the lexicon, including missing words and word senses, and inadequate descriptions of word use in context. Traditional lexicography has similar concerns. The availability of large, on-line text corpora provides a straightforward tool for enlarging the stock of words included in a lexicon. The identification of additional...

متن کامل

Analysis Of A Hand-Tagging Task

We analyze the results of a semantic annotation task performed by novice taggers as part of the WordNet SemCor project (Landes et al., in press). Each polysemous content word in a text was matched to a sense from WordNet. Comparing the performance of the novice taggers with that of experienced lexicographers, we find that the degree of polysemy, part of speech, and the position within the WordN...

متن کامل

Identification of Rare & Novel Senses Using Translations in a Parallel Corpus

The identification of rare and novel senses is a challenge in lexicography. In this paper, we present a new method for finding such senses using a word aligned multilingual parallel corpus. We use the Europarl corpus and therein concentrate on French verbs. We represent each occurrence of a French verb as a high dimensional term vector. The dimensions of such a vector are the possible translati...

متن کامل

"I Don't Believe in Word Senses"

Word sense disambiguation assumes word senses. Within the lexicography and linguistics literature, they are known to be very slippery entities. The paper looks at problems with existing accounts of ‘word sense’ and describes the various kinds of ways in which a word’s meaning can deviate from its core meaning. An analysis is presented in which word senses are abstractions from clusters of corpu...

متن کامل

ITRI - 97 - 12 ” I don ’ t believe in word senses ”

Word sense disambiguation assumes word senses. Within the lexicography and linguistics literature, they are known to be very slippery entities. The paper looks at problems with existing accounts of ‘word sense’ and describes the various kinds of ways in which a word’s meaning can deviate from its core meaning. An analysis is presented in which word senses are abstractions from clusters of corpu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004